Overview

Dataset statistics

Number of variables13
Number of observations6497
Missing cells0
Missing cells (%)0.0%
Duplicate rows1177
Duplicate rows (%)18.1%
Total size in memory660.0 KiB
Average record size in memory104.0 B

Variable types

NUM11
CAT2

Warnings

Dataset has 1177 (18.1%) duplicate rows Duplicates
citric acid has 151 (2.3%) zeros Zeros

Reproduction

Analysis started2020-11-13 07:41:20.158525
Analysis finished2020-11-13 07:41:39.486287
Duration19.33 seconds
Software versionpandas-profiling v2.9.0
Download configurationconfig.yaml

Variables

wine_type
Categorical

Distinct2
Distinct (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size50.8 KiB
white wine
4898 
red wine
1599 
ValueCountFrequency (%) 
white wine489875.4%
 
red wine159924.6%
 
Frequencies of value counts

Unique

Unique0 ?
Unique (%)0.0%
Histogram of lengths of the category

Length

Max length10
Median length10
Mean length9.507772818
Min length8

citric acid
Real number (ℝ≥0)

ZEROS

Distinct89
Distinct (%)1.4%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean0.3186332153
Minimum0
Maximum1.66
Zeros151
Zeros (%)2.3%
Memory size50.8 KiB

Quantile statistics

Minimum0
5-th percentile0.05
Q10.25
median0.31
Q30.39
95-th percentile0.56
Maximum1.66
Range1.66
Interquartile range (IQR)0.14

Descriptive statistics

Standard deviation0.1453178649
Coefficient of variation (CV)0.4560662791
Kurtosis2.397239216
Mean0.3186332153
Median Absolute Deviation (MAD)0.07
Skewness0.4717306725
Sum2070.16
Variance0.02111728186
MonotocityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%) 
0.33375.2%
 
0.283014.6%
 
0.322894.4%
 
0.492834.4%
 
0.262574.0%
 
0.342493.8%
 
0.292443.8%
 
0.272363.6%
 
0.242323.6%
 
0.312303.5%
 
Other values (79)383959.1%
 
ValueCountFrequency (%) 
01512.3%
 
0.01400.6%
 
0.02560.9%
 
0.03320.5%
 
0.04410.6%
 
ValueCountFrequency (%) 
1.661< 0.1%
 
1.231< 0.1%
 
160.1%
 
0.991< 0.1%
 
0.912< 0.1%
 

fixed acidity
Real number (ℝ≥0)

Distinct106
Distinct (%)1.6%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean7.215307065
Minimum3.8
Maximum15.9
Zeros0
Zeros (%)0.0%
Memory size50.8 KiB

Quantile statistics

Minimum3.8
5-th percentile5.7
Q16.4
median7
Q37.7
95-th percentile9.8
Maximum15.9
Range12.1
Interquartile range (IQR)1.3

Descriptive statistics

Standard deviation1.296433758
Coefficient of variation (CV)0.1796782516
Kurtosis5.061160665
Mean7.215307065
Median Absolute Deviation (MAD)0.6
Skewness1.723289647
Sum46877.85
Variance1.680740488
MonotocityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%) 
6.83545.4%
 
6.63275.0%
 
6.43054.7%
 
72824.3%
 
6.92794.3%
 
7.22734.2%
 
6.72644.1%
 
7.12574.0%
 
6.52423.7%
 
7.42383.7%
 
Other values (96)367656.6%
 
ValueCountFrequency (%) 
3.81< 0.1%
 
3.91< 0.1%
 
4.22< 0.1%
 
4.43< 0.1%
 
4.51< 0.1%
 
ValueCountFrequency (%) 
15.91< 0.1%
 
15.62< 0.1%
 
15.52< 0.1%
 
152< 0.1%
 
14.31< 0.1%
 

volatile acidity
Real number (ℝ≥0)

Distinct187
Distinct (%)2.9%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean0.3396659997
Minimum0.08
Maximum1.58
Zeros0
Zeros (%)0.0%
Memory size50.8 KiB

Quantile statistics

Minimum0.08
5-th percentile0.16
Q10.23
median0.29
Q30.4
95-th percentile0.67
Maximum1.58
Range1.5
Interquartile range (IQR)0.17

Descriptive statistics

Standard deviation0.1646364741
Coefficient of variation (CV)0.4847010717
Kurtosis2.825372417
Mean0.3396659997
Median Absolute Deviation (MAD)0.08
Skewness1.495096542
Sum2206.81
Variance0.0271051686
MonotocityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%) 
0.282864.4%
 
0.242664.1%
 
0.262563.9%
 
0.252383.7%
 
0.222353.6%
 
0.272323.6%
 
0.232213.4%
 
0.22173.3%
 
0.32143.3%
 
0.322053.2%
 
Other values (177)412763.5%
 
ValueCountFrequency (%) 
0.0840.1%
 
0.0851< 0.1%
 
0.091< 0.1%
 
0.160.1%
 
0.10560.1%
 
ValueCountFrequency (%) 
1.581< 0.1%
 
1.332< 0.1%
 
1.241< 0.1%
 
1.1851< 0.1%
 
1.181< 0.1%
 

residual sugar
Real number (ℝ≥0)

Distinct316
Distinct (%)4.9%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean5.443235339
Minimum0.6
Maximum65.8
Zeros0
Zeros (%)0.0%
Memory size50.8 KiB

Quantile statistics

Minimum0.6
5-th percentile1.2
Q11.8
median3
Q38.1
95-th percentile15
Maximum65.8
Range65.2
Interquartile range (IQR)6.3

Descriptive statistics

Standard deviation4.757803743
Coefficient of variation (CV)0.8740764355
Kurtosis4.359271948
Mean5.443235339
Median Absolute Deviation (MAD)1.7
Skewness1.435404263
Sum35364.7
Variance22.63669646
MonotocityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%) 
22353.6%
 
1.82283.5%
 
1.62233.4%
 
1.42193.4%
 
1.21953.0%
 
2.21872.9%
 
2.11792.8%
 
1.91762.7%
 
1.71752.7%
 
1.51722.6%
 
Other values (306)450869.4%
 
ValueCountFrequency (%) 
0.62< 0.1%
 
0.770.1%
 
0.8250.4%
 
0.9410.6%
 
0.9540.1%
 
ValueCountFrequency (%) 
65.81< 0.1%
 
31.62< 0.1%
 
26.052< 0.1%
 
23.51< 0.1%
 
22.61< 0.1%
 

chlorides
Real number (ℝ≥0)

Distinct214
Distinct (%)3.3%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean0.05603386178
Minimum0.009
Maximum0.611
Zeros0
Zeros (%)0.0%
Memory size50.8 KiB

Quantile statistics

Minimum0.009
5-th percentile0.028
Q10.038
median0.047
Q30.065
95-th percentile0.102
Maximum0.611
Range0.602
Interquartile range (IQR)0.027

Descriptive statistics

Standard deviation0.03503360137
Coefficient of variation (CV)0.6252219686
Kurtosis50.89805146
Mean0.05603386178
Median Absolute Deviation (MAD)0.011
Skewness5.399827732
Sum364.052
Variance0.001227353225
MonotocityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%) 
0.0442063.2%
 
0.0362003.1%
 
0.0421872.9%
 
0.0461852.8%
 
0.0481822.8%
 
0.041822.8%
 
0.051822.8%
 
0.0471752.7%
 
0.0451742.7%
 
0.0381692.6%
 
Other values (204)465571.6%
 
ValueCountFrequency (%) 
0.0091< 0.1%
 
0.0123< 0.1%
 
0.0131< 0.1%
 
0.01440.1%
 
0.01540.1%
 
ValueCountFrequency (%) 
0.6111< 0.1%
 
0.611< 0.1%
 
0.4671< 0.1%
 
0.4641< 0.1%
 
0.4221< 0.1%
 

free sulfur dioxide
Real number (ℝ≥0)

Distinct135
Distinct (%)2.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean30.52531938
Minimum1
Maximum289
Zeros0
Zeros (%)0.0%
Memory size50.8 KiB

Quantile statistics

Minimum1
5-th percentile6
Q117
median29
Q341
95-th percentile61
Maximum289
Range288
Interquartile range (IQR)24

Descriptive statistics

Standard deviation17.74939977
Coefficient of variation (CV)0.5814648342
Kurtosis7.906238067
Mean30.52531938
Median Absolute Deviation (MAD)12
Skewness1.220066074
Sum198323
Variance315.0411923
MonotocityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%) 
291832.8%
 
61702.6%
 
261612.5%
 
151572.4%
 
241522.3%
 
311522.3%
 
171492.3%
 
341462.2%
 
351442.2%
 
231422.2%
 
Other values (125)494176.1%
 
ValueCountFrequency (%) 
13< 0.1%
 
22< 0.1%
 
3590.9%
 
4520.8%
 
51292.0%
 
ValueCountFrequency (%) 
2891< 0.1%
 
146.51< 0.1%
 
138.51< 0.1%
 
1311< 0.1%
 
1281< 0.1%
 

total sulfur dioxide
Real number (ℝ≥0)

Distinct276
Distinct (%)4.2%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean115.7445744
Minimum6
Maximum440
Zeros0
Zeros (%)0.0%
Memory size50.8 KiB

Quantile statistics

Minimum6
5-th percentile19
Q177
median118
Q3156
95-th percentile206
Maximum440
Range434
Interquartile range (IQR)79

Descriptive statistics

Standard deviation56.52185452
Coefficient of variation (CV)0.488332648
Kurtosis-0.3716636549
Mean115.7445744
Median Absolute Deviation (MAD)39
Skewness-0.001177478234
Sum751992.5
Variance3194.720039
MonotocityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%) 
111721.1%
 
113651.0%
 
122570.9%
 
117570.9%
 
128560.9%
 
124560.9%
 
98560.9%
 
114560.9%
 
118550.8%
 
150540.8%
 
Other values (266)591391.0%
 
ValueCountFrequency (%) 
63< 0.1%
 
740.1%
 
8140.2%
 
9150.2%
 
10280.4%
 
ValueCountFrequency (%) 
4401< 0.1%
 
366.51< 0.1%
 
3441< 0.1%
 
3131< 0.1%
 
307.51< 0.1%
 

density
Real number (ℝ≥0)

Distinct998
Distinct (%)15.4%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean0.9946966338
Minimum0.98711
Maximum1.03898
Zeros0
Zeros (%)0.0%
Memory size50.8 KiB

Quantile statistics

Minimum0.98711
5-th percentile0.9899
Q10.99234
median0.99489
Q30.99699
95-th percentile0.999392
Maximum1.03898
Range0.05187
Interquartile range (IQR)0.00465

Descriptive statistics

Standard deviation0.002998673004
Coefficient of variation (CV)0.003014660854
Kurtosis6.606066991
Mean0.9946966338
Median Absolute Deviation (MAD)0.00231
Skewness0.5036017301
Sum6462.54403
Variance8.992039783e-06
MonotocityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%) 
0.9972691.1%
 
0.9976691.1%
 
0.992641.0%
 
0.998641.0%
 
0.9928631.0%
 
0.9986610.9%
 
0.9966590.9%
 
0.9962590.9%
 
0.9968550.8%
 
0.9956550.8%
 
Other values (988)587990.5%
 
ValueCountFrequency (%) 
0.987111< 0.1%
 
0.987131< 0.1%
 
0.987221< 0.1%
 
0.98741< 0.1%
 
0.987422< 0.1%
 
ValueCountFrequency (%) 
1.038981< 0.1%
 
1.01032< 0.1%
 
1.003692< 0.1%
 
1.00321< 0.1%
 
1.003153< 0.1%
 

pH
Real number (ℝ≥0)

Distinct108
Distinct (%)1.7%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean3.218500847
Minimum2.72
Maximum4.01
Zeros0
Zeros (%)0.0%
Memory size50.8 KiB

Quantile statistics

Minimum2.72
5-th percentile2.97
Q13.11
median3.21
Q33.32
95-th percentile3.5
Maximum4.01
Range1.29
Interquartile range (IQR)0.21

Descriptive statistics

Standard deviation0.1607872021
Coefficient of variation (CV)0.04995717254
Kurtosis0.3676572674
Mean3.218500847
Median Absolute Deviation (MAD)0.11
Skewness0.3868387981
Sum20910.6
Variance0.02585252436
MonotocityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%) 
3.162003.1%
 
3.141933.0%
 
3.221852.8%
 
3.21762.7%
 
3.191702.6%
 
3.151702.6%
 
3.181682.6%
 
3.241612.5%
 
3.121542.4%
 
3.11542.4%
 
Other values (98)476673.4%
 
ValueCountFrequency (%) 
2.721< 0.1%
 
2.742< 0.1%
 
2.771< 0.1%
 
2.793< 0.1%
 
2.83< 0.1%
 
ValueCountFrequency (%) 
4.012< 0.1%
 
3.92< 0.1%
 
3.851< 0.1%
 
3.821< 0.1%
 
3.811< 0.1%
 

alcohol
Real number (ℝ≥0)

Distinct111
Distinct (%)1.7%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean10.49180083
Minimum8
Maximum14.9
Zeros0
Zeros (%)0.0%
Memory size50.8 KiB

Quantile statistics

Minimum8
5-th percentile9
Q19.5
median10.3
Q311.3
95-th percentile12.7
Maximum14.9
Range6.9
Interquartile range (IQR)1.8

Descriptive statistics

Standard deviation1.192711749
Coefficient of variation (CV)0.1136803651
Kurtosis-0.5316873829
Mean10.49180083
Median Absolute Deviation (MAD)0.9
Skewness0.5657177291
Sum68165.23
Variance1.422561316
MonotocityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%) 
9.53675.6%
 
9.43325.1%
 
9.22714.2%
 
102293.5%
 
10.52273.5%
 
112173.3%
 
92153.3%
 
9.82143.3%
 
10.41943.0%
 
9.31933.0%
 
Other values (101)403862.2%
 
ValueCountFrequency (%) 
82< 0.1%
 
8.450.1%
 
8.5100.2%
 
8.6230.4%
 
8.7801.2%
 
ValueCountFrequency (%) 
14.91< 0.1%
 
14.21< 0.1%
 
14.051< 0.1%
 
14120.2%
 
13.93< 0.1%
 

quality
Real number (ℝ≥0)

Distinct7
Distinct (%)0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean5.818377713
Minimum3
Maximum9
Zeros0
Zeros (%)0.0%
Memory size50.8 KiB

Quantile statistics

Minimum3
5-th percentile5
Q15
median6
Q36
95-th percentile7
Maximum9
Range6
Interquartile range (IQR)1

Descriptive statistics

Standard deviation0.8732552715
Coefficient of variation (CV)0.1500856965
Kurtosis0.2323222693
Mean5.818377713
Median Absolute Deviation (MAD)1
Skewness0.1896226934
Sum37802
Variance0.7625747693
MonotocityNot monotonic
Histogram with fixed size bins (bins=7)
ValueCountFrequency (%) 
6283643.7%
 
5213832.9%
 
7107916.6%
 
42163.3%
 
81933.0%
 
3300.5%
 
950.1%
 
ValueCountFrequency (%) 
3300.5%
 
42163.3%
 
5213832.9%
 
6283643.7%
 
7107916.6%
 
ValueCountFrequency (%) 
950.1%
 
81933.0%
 
7107916.6%
 
6283643.7%
 
5213832.9%
 

quality_label
Categorical

Distinct3
Distinct (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size50.8 KiB
medium quality
3915 
low quality
2384 
high quality
 
198
ValueCountFrequency (%) 
medium quality391560.3%
 
low quality238436.7%
 
high quality1983.0%
 
Frequencies of value counts

Unique

Unique0 ?
Unique (%)0.0%
Histogram of lengths of the category

Length

Max length14
Median length14
Mean length12.83823303
Min length11

Interactions

Correlations

Pearson's r

The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.

To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.

Spearman's ρ

The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.

To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.

Kendall's τ

Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.

To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.

Phik (φk)

Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here.

Cramér's V (φc)

Cramér's V is an association measure for nominal random variables. The coefficient ranges from 0 to 1, with 0 indicating independence and 1 indicating perfect association. The empirical estimators used for Cramér's V have been proved to be biased, even for large samples. We use a bias-corrected measure that has been proposed by Bergsma in 2013 that can be found here.

Missing values

Sample

First rows

wine_typecitric acidfixed acidityvolatile acidityresidual sugarchloridesfree sulfur dioxidetotal sulfur dioxidedensitypHalcoholqualityquality_label
0red wine0.226.80.561.80.07415.024.00.994383.4011.26medium quality
1white wine0.366.40.302.00.05218.0141.00.992733.3810.56medium quality
2white wine0.295.90.173.10.03032.0123.00.989133.4113.77medium quality
3white wine0.247.00.241.80.04729.091.00.992513.309.96medium quality
4white wine0.076.40.451.10.03010.0131.00.990502.9710.85low quality
5white wine0.266.00.206.80.04922.093.00.992803.1511.06medium quality
6white wine0.496.80.220.90.05226.0128.00.991003.2511.46medium quality
7white wine0.617.10.4311.80.04554.0155.00.997403.118.75low quality
8white wine0.456.20.3610.40.06022.0184.00.997113.319.86medium quality
9red wine0.669.50.552.30.38712.037.00.998203.179.65low quality

Last rows

wine_typecitric acidfixed acidityvolatile acidityresidual sugarchloridesfree sulfur dioxidetotal sulfur dioxidedensitypHalcoholqualityquality_label
6487white wine0.166.50.361.30.05411.0107.00.993983.198.55low quality
6488red wine0.448.50.341.70.0796.012.00.996053.5210.75low quality
6489white wine0.306.80.321.00.04922.0113.00.992893.2410.25low quality
6490white wine0.287.00.141.30.02610.056.00.993523.469.95low quality
6491white wine0.376.90.289.10.03716.076.00.994803.0511.15low quality
6492red wine0.4512.70.592.30.08211.022.01.000003.009.36medium quality
6493white wine0.358.00.251.10.05413.0136.00.993663.089.55low quality
6494red wine0.148.30.852.50.09313.054.00.997243.3610.15low quality
6495red wine0.106.30.601.60.04812.026.00.993063.5512.15low quality
6496white wine0.497.10.181.30.03312.072.00.990723.0511.37medium quality

Duplicate rows

Most frequent

wine_typecitric acidfixed acidityvolatile acidityresidual sugarchloridesfree sulfur dioxidetotal sulfur dioxidedensitypHalcoholqualityquality_labelcount
438white wine0.277.30.1913.90.05745.0155.00.998072.948.88high quality8
473white wine0.287.00.1514.70.05129.0149.00.997922.969.07medium quality8
549white wine0.306.80.1812.80.06219.0171.00.998083.009.07medium quality7
565white wine0.307.40.1613.70.05633.0168.00.998252.908.77medium quality7
439white wine0.277.40.1615.50.05025.0135.00.998402.908.77medium quality6
566white wine0.307.40.1912.80.05348.5229.00.998603.149.17medium quality6
567white wine0.307.60.2014.20.05653.0212.50.999003.148.98high quality6
598white wine0.317.40.1914.50.04539.0193.00.998603.109.26medium quality6
273white wine0.205.70.2216.00.04441.0113.00.998623.228.96medium quality5
318white wine0.236.60.2217.30.04737.0118.00.999063.088.86medium quality5